QMSS5015 Lab 2¶
October 12, 2024¶
Sam Anwar¶
The dataset: I'm continuing with the same dataset I used for Lab 1-- 2015 NYC Yellow Taxi rides.
I am pulling the data in via API access to the City of New York's Open Data repository because the full file (over 14 million rows) is too large to store locally. I set it to pull in 10,000 rows at random. The exact data you see will be different from the exact data I pulled from the API. I tried to keep observations as general as possible with this in mind, but you may see different descriptive statistics than the ones I write about in these markdown cells if you re-ran this with a new kernel.
I use the Plotly library for my visualizations. They're all interactive! Please explore the data dynamically.
Data source: https://data.cityofnewyork.us/Transportation/2015-Yellow-Taxi-Trip-Data/2yzn-sicd/about_data
# import all necessary libraries
import pandas as pd
import plotly.express as px
import json
import geopandas as gpd
from shapely.geometry import Point
Question 1: Recode 2 different variables into new categories. They can both be continuous-ish or both be nominal-ish, or one of each. Tell me what you did and explain the variable(s).¶
For the first variable, I want to recode pick up lat/longs into neighborhoods. This will involve transforming the lat/longs into a geometric object, and then merging with a geoJSON file which contains polygons of each NYC neighborhood. I'm interested in this because the lat/longs on their own are not easy to interpret. Grouping them by neighborhood will lead to more interesting analysis.
I also recode several neighborhoods as "Other" if their frequency is low.
GeoJSON file source here: https://github.com/HodgesWardElliott/custom-nyc-neighborhoods
# read in the data
df = pd.read_csv('https://data.cityofnewyork.us/resource/2yzn-sicd.csv?$limit=50000') # default limit value is 1000 so set to 50,000
df = df.dropna(subset=['ratecodeid', 'pickup_latitude', 'pickup_longitude',
'dropoff_latitude', 'dropoff_longitude', 'total_amount']) # drop rows if any of these cols of interest are NA
df.drop(df[df['pickup_latitude'] == 0].index, inplace=True) # drop if pickup_latitude is 0 (data error)
df.drop(df[df['total_amount'] < 0].index, inplace = True) # drop if total amount paid is negative (data error)
df = df.reset_index().drop(columns = 'index') # reset index
df.head()
| passenger_count | trip_distance | pickup_longitude | pickup_latitude | ratecodeid | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | fare_amount | extra | mta_tax | tip_amount | tolls_amount | total_amount | vendor_id | pickup_datetime | dropoff_datetime | rate_code | payment_type | imp_surcharge | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.75 | -73.981003 | 40.759609 | 1 | N | -73.991791 | 40.746620 | 11.0 | 0.5 | 0.5 | 0.00 | 0.00 | 12.30 | 2 | 2015-08-21T23:14:17.000 | 2015-08-21T23:30:20.000 | NaN | 2 | 0.3 |
| 1 | 2 | 1.55 | -73.991776 | 40.750240 | 1 | N | -73.981514 | 40.762455 | 7.5 | 0.0 | 0.5 | 0.00 | 0.00 | 8.30 | 2 | 2015-08-01T10:59:47.000 | 2015-08-01T11:08:12.000 | NaN | 2 | 0.3 |
| 2 | 1 | 15.10 | -73.983566 | 40.749882 | 2 | N | -73.804558 | 40.649757 | 52.0 | 0.0 | 0.5 | 11.65 | 5.54 | 69.99 | 1 | 2015-08-17T14:55:42.000 | 2015-08-17T15:34:57.000 | NaN | 1 | 0.3 |
| 3 | 1 | 2.90 | -74.000458 | 40.727295 | 1 | N | -73.979805 | 40.761723 | 10.5 | 0.5 | 0.5 | 0.00 | 0.00 | 11.80 | 1 | 2015-08-08T02:51:51.000 | 2015-08-08T03:01:14.000 | NaN | 2 | 0.3 |
| 4 | 1 | 15.90 | -73.972786 | 40.749996 | 2 | N | -73.786385 | 40.644257 | 52.0 | 0.0 | 0.5 | 11.67 | 5.54 | 70.01 | 2 | 2015-08-30T08:15:38.000 | 2015-08-30T08:34:47.000 | NaN | 1 | 0.3 |
# read in geoJSON file of NYC neighborhoods
url = 'https://raw.githubusercontent.com/HodgesWardElliott/custom-nyc-neighborhoods/refs/heads/master/custom-pedia-cities-nyc-Mar2018.geojson'
neighborhoods = gpd.read_file(url)
# need to turn lat/lon in original df to a geometric object
# create geometry column in df from latitude and longitude
df['geometry'] = df.apply(lambda row: Point(row['pickup_longitude'], row['pickup_latitude']), axis=1)
# convert df to a geodf
gdf = gpd.GeoDataFrame(df, geometry='geometry')
# make sure both geodfs are at the same coordinate reference system
gdf.set_crs(epsg=4326, inplace=True)
neighborhoods = neighborhoods.to_crs(epsg=4326)
# spatial join the geometric point from og dataframe with neighborhoods file
# now each coordinate is mapped to a neighborhood
joined_gdf = gpd.sjoin(gdf, neighborhoods, how='left', op='within')
joined_gdf.head()
/opt/anaconda3/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3517: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead. if await self.run_code(code, result, async_=asy):
| passenger_count | trip_distance | pickup_longitude | pickup_latitude | ratecodeid | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | fare_amount | extra | ... | dropoff_datetime | rate_code | payment_type | imp_surcharge | geometry | index_right | neighborhood | boroughCode | borough | X.id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.75 | -73.981003 | 40.759609 | 1 | N | -73.991791 | 40.746620 | 11.0 | 0.5 | ... | 2015-08-21T23:30:20.000 | NaN | 2 | 0.3 | POINT (-73.98100 40.75961) | 278.0 | Theater District | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... |
| 1 | 2 | 1.55 | -73.991776 | 40.750240 | 1 | N | -73.981514 | 40.762455 | 7.5 | 0.0 | ... | 2015-08-01T11:08:12.000 | NaN | 2 | 0.3 | POINT (-73.99178 40.75024) | 51.0 | Chelsea | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... |
| 2 | 1 | 15.10 | -73.983566 | 40.749882 | 2 | N | -73.804558 | 40.649757 | 52.0 | 0.0 | ... | 2015-08-17T15:34:57.000 | NaN | 1 | 0.3 | POINT (-73.98357 40.74988) | 187.0 | Midtown | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... |
| 3 | 1 | 2.90 | -74.000458 | 40.727295 | 1 | N | -73.979805 | 40.761723 | 10.5 | 0.5 | ... | 2015-08-08T03:01:14.000 | NaN | 2 | 0.3 | POINT (-74.00046 40.72729) | 126.0 | Greenwich Village | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... |
| 4 | 1 | 15.90 | -73.972786 | 40.749996 | 2 | N | -73.786385 | 40.644257 | 52.0 | 0.0 | ... | 2015-08-30T08:34:47.000 | NaN | 1 | 0.3 | POINT (-73.97279 40.75000) | 197.0 | Murray Hill | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... |
5 rows × 26 columns
joined_gdf.neighborhood.value_counts()
# let's regroup neighborhoods with low frequency to "Other"
neighborhood
Midtown 9084
Upper East Side 6064
Chelsea 4859
Upper West Side 3836
Hell's Kitchen 2655
...
Borough Park 1
Arverne 1
Morrisania 1
Jamaica Estates 1
Rockaway Park 1
Name: count, Length: 122, dtype: int64
joined_gdf = joined_gdf.dropna(subset=['neighborhood'])
freq = joined_gdf.neighborhood.value_counts().to_dict()
neighbor2 = []
for n in joined_gdf.neighborhood:
if freq[n] < 500:
neighbor2.append('Other')
else:
neighbor2.append(n)
joined_gdf['neighbor_recode'] = neighbor2
joined_gdf.head()
| passenger_count | trip_distance | pickup_longitude | pickup_latitude | ratecodeid | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | fare_amount | extra | ... | rate_code | payment_type | imp_surcharge | geometry | index_right | neighborhood | boroughCode | borough | X.id | neighbor_recode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.75 | -73.981003 | 40.759609 | 1 | N | -73.991791 | 40.746620 | 11.0 | 0.5 | ... | NaN | 2 | 0.3 | POINT (-73.98100 40.75961) | 278.0 | Theater District | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... | Theater District |
| 1 | 2 | 1.55 | -73.991776 | 40.750240 | 1 | N | -73.981514 | 40.762455 | 7.5 | 0.0 | ... | NaN | 2 | 0.3 | POINT (-73.99178 40.75024) | 51.0 | Chelsea | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... | Chelsea |
| 2 | 1 | 15.10 | -73.983566 | 40.749882 | 2 | N | -73.804558 | 40.649757 | 52.0 | 0.0 | ... | NaN | 1 | 0.3 | POINT (-73.98357 40.74988) | 187.0 | Midtown | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... | Midtown |
| 3 | 1 | 2.90 | -74.000458 | 40.727295 | 1 | N | -73.979805 | 40.761723 | 10.5 | 0.5 | ... | NaN | 2 | 0.3 | POINT (-74.00046 40.72729) | 126.0 | Greenwich Village | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... | Greenwich Village |
| 4 | 1 | 15.90 | -73.972786 | 40.749996 | 2 | N | -73.786385 | 40.644257 | 52.0 | 0.0 | ... | NaN | 1 | 0.3 | POINT (-73.97279 40.75000) | 197.0 | Murray Hill | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... | Murray Hill |
5 rows × 27 columns
# for readability, also shorten some of the neighborhoods
joined_gdf['neighbor_recode'] = joined_gdf['neighbor_recode'].replace(
{'John F. Kennedy International Airport': 'JFK',
'LaGuardia Airport': 'LGA',
'Stuyvesant Town': 'Stuy Town'})
fig = px.histogram(joined_gdf, x = 'neighbor_recode',
height = 600)
fig.update_xaxes(categoryorder = 'total descending')
fig.show()
Observations: Recoding pickup lat/longs into neighborhood makes the data far easier to interpret, simplifying a continuous variable into 24 buckets.
For my second recode, I'll recode total amount paid into the below categories:
- Low: < $20
- Moderate: $20 - 50
- High: $50 - 100
- Most Expensive: > $100
joined_gdf.total_amount.describe()
count 49254.000000 mean 16.134297 std 13.381670 min 0.000000 25% 8.750000 50% 11.800000 75% 17.800000 max 450.310000 Name: total_amount, dtype: float64
paid_labels = ['Low Cost', 'Moderate Cost', 'High Cost', 'Most Expensive']
joined_gdf['cost_tier'] = pd.cut(joined_gdf.total_amount,
bins = [0, 20, 50, 100, 500],
labels = paid_labels,
right = True, ordered = True)
# look at data via a box plot
box = px.box(joined_gdf, x = 'cost_tier', y = 'total_amount',
height = 600, width = 800)
box.show()
Observations: The box plot for this looks a little strange because of the variability in the data. Most rides are < $200 which is where the quantiles are marked, but there are many outliers above the Upper Fence of the Most Expensive category.
Question 2: Use one (or both) of your recoded variables to do a cross-tabulation. Explain your results.¶
I'm interested in cross-tabulating both of our recoded variables, neighborhood and cost tier.
pd.crosstab(joined_gdf.cost_tier, joined_gdf.neighbor_recode, normalize='columns').round(4)*100
| neighbor_recode | Battery Park City | Central Park | Chelsea | East Village | Financial District | Flatiron District | Gramercy | Greenwich Village | Hell's Kitchen | JFK | ... | Lower East Side | Midtown | Murray Hill | Other | SoHo | Theater District | Tribeca | Upper East Side | Upper West Side | West Village |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cost_tier | |||||||||||||||||||||
| Low Cost | 61.6 | 88.51 | 86.38 | 83.76 | 57.33 | 87.04 | 84.91 | 85.34 | 86.52 | 6.43 | ... | 75.89 | 85.59 | 85.85 | 76.33 | 82.21 | 82.57 | 76.80 | 86.38 | 87.10 | 85.00 |
| Moderate Cost | 35.4 | 10.00 | 11.87 | 14.84 | 39.50 | 11.74 | 13.49 | 13.32 | 11.86 | 26.25 | ... | 22.49 | 12.12 | 11.77 | 20.97 | 16.68 | 14.02 | 21.51 | 12.63 | 11.60 | 13.88 |
| High Cost | 2.6 | 1.34 | 1.61 | 1.29 | 3.07 | 1.10 | 1.60 | 1.34 | 1.51 | 66.49 | ... | 1.61 | 2.20 | 2.09 | 2.48 | 1.11 | 3.32 | 1.68 | 0.94 | 1.28 | 1.02 |
| Most Expensive | 0.4 | 0.15 | 0.14 | 0.10 | 0.10 | 0.12 | 0.00 | 0.00 | 0.11 | 0.83 | ... | 0.00 | 0.09 | 0.29 | 0.21 | 0.00 | 0.10 | 0.00 | 0.05 | 0.03 | 0.10 |
4 rows × 22 columns
# turn crosstab into new df so that we can keep the normalized values
cross = pd.crosstab(joined_gdf.cost_tier, joined_gdf.neighbor_recode, normalize='columns').round(4)*100
# pull values for each color in each bar
low = list(cross.loc[cross.index == 'Low Cost'].values.flatten())
moderate = list(cross.loc[cross.index == 'Moderate Cost'].values.flatten())
high = list(cross.loc[cross.index == 'High Cost'].values.flatten())
expensive = list(cross.loc[cross.index == 'Most Expensive'].values.flatten())
import plotly.graph_objects as go
# create bar chart
# add a bar for each cost category
# fill in the values
fig = go.Figure(data=[
go.Bar(name = 'Low Cost', x=list(cross.columns), y=low),
go.Bar(name = 'Moderate Cost', x=list(cross.columns), y=moderate),
go.Bar(name = 'High Cost', x=list(cross.columns), y=high),
go.Bar(name = 'Most Expensive', x=list(cross.columns), y=expensive)
])
# change the bar mode
fig.update_layout(barmode='stack', width = 800, height = 600)
fig.show()
Observations: The cross-tabulation shows the normalized distribution of price tiers based on neighborhood. As expected, most rides fall into low cost. The two that deviate are the airports, JFK and LGA. Cab rides that begin in JFK have a set fare of $70, so it makes sense that most fall into the high cost tier (50 - 100). LGA is interesting because it arguably has the weakest public transit support out of the three airports in the NY Metro area. I personally always Uber/Taxi home from LGA, but will take the train home from JFK. So it anecdotally makes sense that most rides that begin at LGA fall into the moderate cost tier.
Question 3: Run a correlation of one variable with another variable; make all of the recodes necessary to make the correlation as easy to interpret as possible; and explain your results.¶
I'm interested in the correlation between trip distance and tip amount. While it makes sense that further trips cost more, tip is at discretion of the passenger, so it'll be interesting to see if there are any noticeable ceilings.
round(joined_gdf['trip_distance'].corr(joined_gdf['tip_amount']), 3)
0.468
round(joined_gdf['trip_distance'].corr(joined_gdf['total_amount']), 3)
0.904
# visualize the data
scatter = px.scatter(joined_gdf.sample(n=500, random_state = 1), # for readability, I plot a random sample of 500 data points
x = 'trip_distance',
y = 'tip_amount',
height = 600,
width = 800)
scatter.show()
Observations: The correlation between trip_distance and tip_amount is 0.468. Out of curiosity, I also looked at at the correlation between trip_distance and total_amount as a comparison. The correlation for the latter is far stronger at 0.904.
Tip amounts are not standardized so the relationship is weaker. There are also a large number of rides with $0 tip which certainly affects the linearity of the relationship.
Question 4: Identify the most extreme cases on some variable. Interpret the results.¶
Let's look at the most expensive and least expensive rides.
# update settings to view all columns
pd.set_option('display.max_columns', None)
# new df with reset index after sort
gdf = joined_gdf.sort_values(by = 'total_amount', ascending = False).reset_index(drop = True)
# view top 3 most expensive rides
gdf.head(3)
| passenger_count | trip_distance | pickup_longitude | pickup_latitude | ratecodeid | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | fare_amount | extra | mta_tax | tip_amount | tolls_amount | total_amount | vendor_id | pickup_datetime | dropoff_datetime | rate_code | payment_type | imp_surcharge | geometry | index_right | neighborhood | boroughCode | borough | X.id | neighbor_recode | cost_tier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.00 | -73.977936 | 40.593708 | 5 | N | -73.977936 | 40.593708 | 0.01 | 0.0 | 0.0 | 0.0 | 450.00 | 450.31 | 2 | 2015-08-10T01:02:28.000 | 2015-08-10T01:02:54.000 | NaN | 1 | 0.3 | POINT (-73.97794 40.59371) | 121.0 | Gravesend | 3 | Brooklyn | http://nyc.pediacities.com/Resource/Neighborho... | Other | Most Expensive |
| 1 | 1 | 27.10 | -74.164146 | 40.618874 | 1 | N | -73.950554 | 40.808949 | 78.00 | 0.0 | 0.5 | 266.0 | 0.00 | 344.80 | 2 | 2015-08-22T12:40:25.000 | 2015-08-22T13:53:19.000 | NaN | 1 | 0.3 | POINT (-74.16415 40.61887) | 42.0 | Bull's Head | 5 | Staten Island | http://nyc.pediacities.com/Resource/Neighborho... | Other | Most Expensive |
| 2 | 1 | 19.27 | -73.975105 | 40.765198 | 1 | N | -74.072388 | 40.902721 | 145.50 | 0.0 | 0.5 | 92.0 | 11.75 | 250.05 | 2 | 2015-08-26T10:21:59.000 | 2015-08-26T14:05:38.000 | NaN | 1 | 0.3 | POINT (-73.97511 40.76520) | 49.0 | Central Park | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... | Central Park | Most Expensive |
# view bottom 3 priced rides
gdf.tail(3)
| passenger_count | trip_distance | pickup_longitude | pickup_latitude | ratecodeid | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | fare_amount | extra | mta_tax | tip_amount | tolls_amount | total_amount | vendor_id | pickup_datetime | dropoff_datetime | rate_code | payment_type | imp_surcharge | geometry | index_right | neighborhood | boroughCode | borough | X.id | neighbor_recode | cost_tier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 49251 | 1 | 0.0 | -73.950691 | 40.742889 | 5 | N | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 2015-08-15T22:45:04.000 | 2015-08-15T22:45:04.000 | NaN | 2 | 0.0 | POINT (-73.95069 40.74289) | 174.0 | Long Island City | 4 | Queens | http://nyc.pediacities.com/Resource/Neighborho... | Other | NaN |
| 49252 | 1 | 0.0 | -74.001907 | 40.734409 | 5 | N | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 2015-08-14T02:30:29.000 | 2015-08-14T02:30:29.000 | NaN | 2 | 0.0 | POINT (-74.00191 40.73441) | 297.0 | West Village | 1 | Manhattan | http://nyc.pediacities.com/Resource/Neighborho... | West Village | NaN |
| 49253 | 2 | 0.0 | -73.936996 | 40.764622 | 1 | N | -73.936813 | 40.764709 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 2015-08-04T10:00:59.000 | 2015-08-04T10:02:51.000 | NaN | 2 | 0.0 | POINT (-73.93700 40.76462) | 174.0 | Long Island City | 4 | Queens | http://nyc.pediacities.com/Resource/Neighborho... | Other | NaN |
Observations: Looking at the extremes of total_amount reveals what are likely data errors. For instance, in the most expensive ride, trip_distance is 0, but tolls_amount is 450. In the bottom three rides, dropoff_longitude and dropoff_longitude are at 0 which does not make any sense.
I took a closer look at the codebook and there's a variable called payment_type that indicates how the passenger paid for their rides. In the future, I would maybe use this column to filter out potential errors, perhaps only focusing on credit card data.